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"^j" Abstract 

" We present an efficient numerical method for computing Hamiltonian matrix elements between non-orthogonal Slater determi- 
^ nants, focusing on the most time-consuming component of the calculation that involves a sparse array. In the usual case where 
many matrix elements should be calculated, this computation can be transformed into a multiplication of dense matrices. It is 
demonstrated that the present method based on the matrix-matrix multiplication attains ~80% of the theoretical peak performance 
| measured on systems equipped with modern microprocessors, a factor of 5-10 better than the normal method using indirectly 
. .indexed arrays to treat a sparse array. The reason for such different performances is discussed from the viewpoint of memory 
l access. 
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1. Introduction 

One of the main issues in the quantum many-body problem 
is solving a Schrodinger equation to good accuracy in reason- 
able computational time. While mean-field solutions such as 
that of the Hartree-Fock method are very successful in various 
systems, the inclusion of effects beyond the mean field, i.e., cor- 
relation, is highly desired for better description. For instance, 
the mean-field wave function does not necessarily have a good 
quantum number that is conserved in the exact solution such as 
the total angular momentum. 

A superposition of Slater determinants is the usual way to 
overcome the limitation of the mean-field method. Among var- 
ious schemes to represent a correlated wave function, a rep- 
resentation by non-orthogonal Slater determinants (or quasi- 
particle vacua in general) is a method which is widely used 
in the nuclear many -body problems U]. The method, often 
associated with the generator coordinate method (GCM) J2l, 
has been successfully applied, for instance, to the description 
of collective motion and to the restoration of broken symmetry 
J3I]. Furthermore, the use of non-orthogonal Slater determinants 
has recently opened a new possibility for representing a precise 
many-body wave function in an efficient way, as demonstrated 
by the Monte Carlo shell model (MCSM) [4], variants of the 
VAMPIR method H], and a hybrid method between MCSM and 
VAMPIR 0]. The MCSM method is now capable of precisely 
evaluating the eigenvalues even for a system beyond exact cal- 
culation by introducing a novel extrapolation method utilizing 



the variance of energy |01. There have been some studies using 
the superposition of non-orthogonal Slater determinants also in 
quantum chemistry IMI^ floll . 

In the present paper, in order to extend the applicability of the 
expression of non-orthogonal Slater determinants, we present a 
numerical method for efficiently computing Hamiltonian matrix 
elements between them. Since we assume a general two-body 
force that has the rotational symmetry only, the present method 
will be applicable to various systems. This paper is organized 
as follows. Section [2] briefly describes the many -body system 
and many -body wave function under consideration. Section [3] 
presents some numerical methods for computing the most time- 
consuming part. In Sec. [4] the computational performances of 
the presented methods are compared, and the reason for their 
differences in performance is discussed. In Sec. [5] we summa- 
rize this paper. 

2. Many-body calculation with non-orthogonal Slater de- 
terminants 

In this paper, we consider the many-body system described 
by the Hamiltonian consisting of a one-body operator T and a 
two-body operator V, 
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H = T + V = ^ thhc]c h + - ^ v hhhu c\c\c u c h , (1) 
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where c, and q are the creation and annihilation operators of 
the state labeled by I, respectively. The one-body matrix ele- 
ments f/,/, are given by f;,/, = (/1ITI/2), and the two-body ma- 
trix elements defined by v/,/ 2 ,/ 3 / 4 = (hlz\V\hh) - {/1/2IVIZ4/3) 
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are antisymmetrized: vi^j^ - — V/i/ 2 ,w 3 . We consider a model 
space consisting of a finite number of single-particle orbits rep- 
resented by N s , and regard a set of the single-particle wave 
functions <pi(x) = (x\c) |— ) (I = 1,2,..., N s ) as a single-particle 
basis set. 

We approximate the solution of Eq. ([TJi by a superposition of 
a finite number of non-orthogonal Slater determinants 



(2) 



where |<£>(g)) and f(q) denote a Slater determinant and its am- 
plitude, respectively. Note that although the wave function |*F) 
is sometimes expressed by a continuous superposition over q as 
is expressed by the GCM, the actual numerical calculation is 
usually performed by the discretization shown in Eq. (0. Each 
Slater determinant, regarded as a many-body basis state, is rep- 
resented by a product of generalized creation operators 



\<$>{q)) =\\a\{q)\-) 



(3) 



where N p is the number of particles, and the creation operator 
a\[q) is given by 



a[{q) = ^ D{q) lk c) 



(4) 



Here the N s xN p matrix (N s > A^) D(q) characterizes the many- 
body basis state \<f>(q)) . In general, the basis states \<f>(q)) are 
non-orthogonal between one another: (<&(q')\0(q)) + 0. Al- 
though an important issue in quantum many -body theory is how 
to choose good \$>(q)), we do not mention it here because the 
aim of this paper is to present an efficient computational method 
which is valid for any calculation of the same type. Once a set 
of the many-body basis states is fixed, one needs to optimize a 
set of amplitudes f(q). This optimization is usually carried out 
with the variational principle: 
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which leads to the Hill- Wheeler equation 101 for a discretized 
coordinate q 

Wf = ENf, (6) 
where *H and N are matrices whose elements are given by 

<H{q',q) = (Qtf)\H\<Kq)) (7) 
N(q',q) = <<D(?0|<D( 9 )>. (8) 

/ is a vector whose component is f(q), and E is the eigenvalue. 
Following the terminology of the GCM, we hereafter call the 
many-body matrix elements of H and N the Hamiltonian kernel 
and the norm kernel, respectively, to avoid confusing them with 
the two-body matrix element v>/,/,,/,/ 4 of a single-particle basis. 
Both the kernels are represented by D(q) and D(q'). The norm 
kernel is written as 



N(q',q) = det(D(q') f D(qj), 



(9) 



and the Hamiltonian kernel is 

( N. 

<H(q',q) = N(q',q) 



N s ^ N, 

^ %hPhl\ + 2 XI Phh%k,hhPkh 



(10) 



using the density matrix p whose matrix element is defined by 



Pw 



Using D(q) and D(q'), the density matrix is given by 



p = D(q)(D(q') f D(q))~ 1 D(q')i 



(11) 



(12) 



Among various applications of the above expression is the 
restoration of broken symmetries. Since a general Slater deter- 
minant of Eq. (0 does not necessarily possess the symmetries 
that the original Hamiltonian has, it is desirable to restore the 
broken symmetries by projecting the wave function onto good 
quantum numbers. The total angular momentum, for instance, 
is restored from |0) by performing a three-dimensional integra- 
tion over the Euler angles To carry out a numerical integra- 
tion, the number of mesh points for the Euler angles is required 
to be as many as the order of 10 4 , as are the numbers of "Hiq 1 ', q) 
and N(q', q) to be calculated QJJ] . 

As thus exemplified, innumerable Slater determinants are of- 
ten involved to obtain a good many-body wave function l^f). 
Hence, fast computation of the Hamiltonian and norm ker- 
nels will accelerate the whole calculation. The most time- 
consuming in the above procedure is the computation of the 
two-body part of the Hamiltonian kernel 
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with 
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(14) 



because such computation requires a fourfold summation over 
the single-particle states. In the following sections, we concen- 
trate on an efficient computational method for Eq. ( TT3l on sys- 
tems equipped with modern microprocessors. We assume that 
the operation of Eq. ( fT3l is repeated a great number of times for 
different density matrices p under the condition of fixed two- 
body matrix elements vi l i 1 ,i 3 i 4 - 

3. Numerical methods for computing the Hamiltonian ker- 
nel 

A straightforward operation of Eq. (TT~3T > is in general a waste 
of computational time because v/,/ 2 ,/ 3 / 4 is very sparse. This 
sparseness is due to the symmetry of the Hamiltonian. For in- 
stance, the conservation of the z component of the angular mo- 
mentum leads to v/ lW3 ; 4 = unless Mh)+j z (h) = j z (h)+jz(h) 
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is satisfied. Depending on the system considered, some other 
symmetries such as parity, orbital angular momentum, and 
isospin quantum numbers are also conserved, which imposes 
further constraints on the non-zero matrix elements. Hence, ev- 
ery effort must be made to avoid taking those vanishing matrix 
elements for efficient computing. Below we show three numer- 
ical algorithms for this purpose. The first method is completely 
different from the other two, and the last method is more ad- 
vanced than the second method. 

Indirect-index method 

As shown in the last paragraph, the operation associated with 
zero for (V) is mainly caused not by the density matrix but by 
the fixed two-body matrix elements Vi 1 ;, i ; 3 ; 4 . Thus, it is useful to 
classify in advance the indeces 1%, h, h) of v/,/ 2i ; 3 ; 4 according 
to whether they lead to non- vanishing v/,; 2; ; 3 ; 4 , and to label the 
set of indeces (l\,l2,h,h) satisfying this condition with a so- 
called indirect index k as (l\{k), h{k), h(k), h{k)). Equation ( fT3l ) 
is then represented as 

00 = 2j Ph(k)h(k)Vnoawm{k)PU(k)h(k), (15) 

where N nonzao is the number of non-vanishing v/,/,^, and 
^nonzero (k) = vi^hgcu^im + 0. When v/,; 2> ; 3 / 4 is sparse, 
Mionzero is much smaller than N A S . In this paper, we refer to 
the numerical algorithm based on Eq. (TT~5T > as the indirect-index 
method. 

Matrix-vector method 

Although the introduction of the indirect index can always be 
applied to the computation of sparse arrays, here we present an 
alternative numerical approach which directly utilizes the sym- 
metry. We now assume that the two-body force V has only the 
rotational invariance for simplicity. Other possible symmetries 
can be treated in a similar way. 

First, N s x N s density-matrix elements pu> are grouped ac- 
cording to Am = j z {l') - j z (l), and the set of {I, I') having a com- 
mon Am is indexed by k = 1,2,..., Nt, m as p(Am)k- In a similar 
way, the two-body matrix elements Vi^j^ are categorized ac- 
cording to Am I3 = j z (h) - j z (h) and Am 24 = j z (h) - j z (h) as 
v(Ami3, Ani24)k'k, where k' and k are, respectively, indeces to 
(h,h) and (h,U) having A»?i3 and Ani2A- Equation cTT~3T > then 
leads to 

(V) = ^ ^p(Am n ) k ,v(Ami3,Am 2 4)k>kP(Am24)k 

Am 13 A/7724 k'k 

= p(-Am)k>v(-Am, Am)k>kP(Am)k, 

Am k'k 

where the last equation of Eq. ( TT6b is derived from the necessary 
condition for Vi^j^ being non-zero: j z (h) + j z (h) = j z (h) + 
j z (l 4 ), i.e., Am 13 = j z {h) - j z (l 3 ) = -(j z (h) - j z (h)) = -Am 24 = 
-Am. 

Since the density matrix p(Ara) and the two-body matrix 
v(-Am, Am) for a given Am are a one-dimensional array and a 
two-dimensional array, respectively, they can be identified with 



Am= -1 +1 
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Figure 1: Schematic illustration of the operation of Eq. 1161 . 



a vector of size Aa«j and a matrix of size Aa m x N^ m , respec- 
tively, by using Na„, = N-A m . Thus, Eq. ( fTo*b is regarded as a 
'(vector)x(matrix)x(vector) operation. This is schematically il- 
lustrated in Fig.Q] It is clearly seen that the sparse array v/,^ ^^ 
is transformed into a block-antidiagonal matrix v whose blocks 
are dense submatrices. In this paper, we refer to the numerical 
algorithm based on Eq. (fTol l as the matrix-vector method. 

Matrix-matrix method 

In the matrix-vector method, most of the computational time 
is devoted to the (matrix)x(vector) operation f = vp, where the 
index of Am is omitted for simplicity. As previously mentioned, 
this operation is usually repeated a number of times for different 
p's: vp (1) , vp (2) , ... By binding vectors p (1) ,p (2) , . . . ,p (A,ra) into 
a matrix 8 = (p (1) ,p (2) , . . . ,p ( - Nvcc) ), repeated (matrix) x (vector) 
operations are performed by a (matrix) x (matrix) operation at 
one time: 

(f (l \ f®, . . . , fW->) = (vp^, yp< 2 \ . . . , vp^) = v6, (17) 

where the number of columns A vec can be chosen arbitrar- 
ily. The (V) for the z'-th density matrix p (/) is then given by 
'p®vp (0 = f p®t® = 'p«(v6>) (0 , where (v8) (i) stands for the z'- 
th column of the matrix v6. We call this method, i.e, the way 
through the (matrix)x(matrix) operation of Eq. ( fTTl i. the matrix- 
matrix method. It seems as if there is no substantial differ- 
ence between the matrix-matrix method and the matrix-vector 
method: equation ( fT7b keeps not only mathematical identity but 
also the number of elementary operations. However, as seen 
in the next section, those two methods result in quite different 
computational performances on actual computer systems. 

4. Measurement of performance 

In this section, computational performance is compared 
(Kpmong the three methods presented in the last section by adopt- 
ing a realistic many-body system and measuring the elapsed 
time to compute Eq. ( fT3l ) repeatedly. 

4. 1. Benchmark system 

Here we consider a nuclear many-body problem where pro- 
tons and neutrons interact in a fixed model space. We adopt 
a set of the single-particle orbits consisting of five harmonic- 
oscillator major shells from harmonic-oscillator's quantum 
number N = to 4: Osi/2, 0p3/2, Opi/2, Oc/5/2, (W3/2, lsi/2, 
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0/7/2, 0/5/2, l/>3/2, lpi/2, 0g9/2, 0g 7 /2, 1^5/2, 1^3/2, and 2si/ 2 . 

Thus, the number of the proton (neutron) single-particle states 
N s is 70. Here, the proton and neutron numbers are set to be two 
and two, respectively, but the number of particles is irrelevant 
to the computational time of Eq. dT3T> . 

The two-body part of the adopted Hamiltonian is an arbi- 
trary one that has rotational, parity and time-reversal symme- 
tries. Due to the rotational and time-reversal symmetries, all 
the matrix elements v>/,/,,/,/ 4 can be real numbers [11]. Since we 
do not assume other symmetries such as an isospin, we calcu- 
late the proton-proton interaction part of Eq. ( [T3l i. the neutron- 
neutron part, and the proton-neutron part independently. For 
this system, the largest submatrix used in the matrix- vector (or 
matrix-matrix) method is of the size 390 x 390, classified ac- 
cording to the z component of the angular momentum and the 
parity. 

The wave function taken is a single Slater determinant with 
total angular-momentum and parity projection. Each single- 
particle state of the Slater determinant is assumed to be a pure 
proton or neutron state. The number of mesh points for the 
three Euler angles and that for the parity projector are 25 3 and 
2, respectively, leading to 25 3 x 2 = 31, 250 times the computa- 
tions of Eq. ( fT3l . Since a rotation of a wave function involves 
imaginary numbers yl, the density matrix has to be complex. 

It would be useful to compare the number of elementary 
floating-point operations (addition and multiplication) among 
the three methods. Taking into account that the loop length of 
Eq. (fT3l l can be halved by using v/,/ 2j ; 3 ; 4 = Vy,^;,, the number 
of elementary floating-point operations becomes 20,992,5 1 8 for 
the indirect-index method and 10,365,224 for the matrix-vector 
method and the matrix-matrix method. The former is almost 
the double of the latter as explained as follows. In the matrix- 
vector method, (yp\i - Vkrip\ + Vktipi + • ■ • is factored out of 
Zk'kPk'Vk'kPk in the way £*< p k >(vk>ipi + h'lPi + ...). This ex- 
pression saves the number of multiplications, and more impor- 
tantly, the reduced operations are the multiplication of complex 
numbers which costs as many as six floating-point operations. 

4.2. Computational environment 

The computation is carried out as a single-threaded process 
on two different systems based on up-to-date scalar processors: 
one system is based on the Xeon E5570 processor with clock 
speed 2.93 GHz and the other is based on the SPARC64 VII 
processor with clock speed 2.5 GHz. Their theoretical peak per- 
formances per CPU core are 1 1.72 GFLOPS and 10 GFLOPS, 
respectively. Our code written in Fortran 90/95/2003 is com- 
piled by Intel Fortran Compiler Version 11.1 for the Xeon sys- 
tem and by Fujitsu Fortran Compiler Driver Version 8.2 for the 
SPARC64 system. The two-body matrix elements v/,;, ;,^ and 
the density matrix elements p«< are of double-precision. Matrix 
and/or vector calculations are coded to call the BLAS interface 
(BLAS lfl2ll is the de facto standard for the programming in- 
terface of basic linear algebra operations). We use optimized 
BLAS implementations: Intel Math Kernel Library (MKL) for 
the Xeon system and Fujitsu Scientific Subroutine Library II 
(SSL II) for the SPARC64 system. The computational perfor- 
mance for executing Eq. ( fL3l is measured with the wall-clock 
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Figure 2: Comparison of the computational performance among the indirect- 
index method (Ind.), matrix-vector method (M-V) and matrix-matrix method 
(M-M) with different N vcc measured on the SPARC64 VII and Xeon E5570 
systems. The values are normalized by their theoretical peak performance. See 
the text for more details. 



time at a microsecond-level resolution, which is good enough 
for the present purpose. 

4.3. Results and analyses 

The performance of a computation is characterized by the in- 
version of the wall-clock time t. It is comprehensive to express 
the performance in FLOPS which is r l in 1/second multiplied 
by the total number of elementary floating-point operations ex- 
ecuted. But since the number of operations is different among 
the methods as shown previously, FLOPS is not a good mea- 
sure for comparing their relative performances. Hence, to make 
direct comparison possible, the performance is now defined by 
t multiplied by a fixed factor of the number of elementary 
floating-point operations of the matrix-vector (or the matrix- 
matrix) method, only when it serves as the actual FLOPS. 

Figure [2] compares the measured performances normalized 
by the theoretical peak performances of the adopted systems. 
The indirect-index method gives the lowest performance for 
both systems. The performance of the matrix- vector method 
is about twice as high as that of the indirect-index method. 
This is almost equivalent to the ratio of the numbers of floating- 
point operations, but is still far from the theoretical peak perfor- 
mance. When p vectors are bound into a matrix in the matrix- 
matrix method, the performance starts to increase. The perfor- 
mance improves sharply even at a small A^ ve c, and is saturated at 
aroundA^vec ~ 30-100 to reach ~70-80 % of the theoretical peak 
performance. The values of the two systems are very close at 
a large N vec in contrast to rather different behavior at a smaller 

Although the matrix-vector and matrix-matrix methods are 
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identical in mathematics, they are quite different in perfor- 
mance. Memory access, the major bottleneck of modern com- 
puter systems, differentiates the methods from each other. Now 
we consider a matrix of size n x n and a vector of size n and 
estimate the number of arithmetic operations and memory ac- 
cesses involving them. Since a matrix-times-vector operation 
needs 2« 2 floating-point operations and n 2 + n memory ac- 
cesses, the computational intensity defined by their ratio is ~ 2. 
On the other hand, the computational intensity for a matrix- 
times-matrix operation becomes n, much larger than that of the 
matrix-vector operation for a sufficiently large n. More specifi- 
cally, the matrix-times-matrix operation can be designed so that 
most of the CPU time can be involved in arithmetric operations 
rather than memory access as is implemented in numerical li- 
braries such as MKL and SSL II. See, for instance, H for 
more detailed analyses of the performance of basic linear alge- 
bra operations in terms of computer architecture. 

Finally, we consider the performance of parallel processes. 
We take an example where the 31,250 'pvp operations are di- 
vided into 32 MPI processes running on a 4 node x 2 CPU x 
4 core Xeon E5570 system. The matrix-matrix method with 
Avec = 100 reaches 8.5 GFLOPS/core, which is rather close to 
the 9.1 GFLOPS achieved by the single process. In contrast, 
for the matrix-vector method, the parallel performance is re- 
duced to 1 .5 GLOPS/core from the single-process performance 
of 3.1 GFLOPS. This difference is also accounted for by the 
memory access: since the memory bandwidth is shared by all 
the CPU cores on the board, the effective bandwidth defined by 
the bandwidth per process or thread is reduced for parallel pro- 
cesses. This reduction of the effective bandwidth leads to the 
reduction of the performance particularly for the processes in- 
volving heavy memory access like the matrix-vector operation. 
Thus, the matrix-matrix method is superior to the matrix-vector 
method not only in absolute performance but also in parallel 
efficiency because of less memory demanding formalism. 

5. Summary 

We have presented an efficient numerical method for com- 
puting Hamiltonian matrix elements between non-orthogonal 
Slater determinants, motivated by recent findings that a super- 
position of non-orthogonal Slater determinants is a very ef- 
fective way to solve a many-body problem. The most com- 
putationally demanding is the computation of a four-fold loop 
(V) = YjUhhkPhh %hhUPUiv where v /lW3 / 4 is a sparse array 
due to the symmetries of the Hamiltonian. While indirectly 
indexed arrays are often introduced for treating a sparse ma- 
trix, the performance of the method has been measured to be 
much lower than the theoretical peak performance. In order to 
fit a formula of (V) to fast computation, its key part is trans- 
formed into a multiplication of a dense matrix and a vector for 
a single (V) calculation, and also into a multiplication of dense 
matrices for multiple (V) calculations. The method based on 
the matrix-matrix multiplication attains as much as ~80% of 
the theoretical peak performance on actual systems. Its high 
performance is accounted for by its high computational inten- 
sity, i.e., a large ratio of floating-point operations to memory 



accesses. Since from the hardware side it is predicted that the 
Byte/FLOP rate of future systems will be decreased 11411 be- 
cause of rapid increase of the number of CPU cores compared 
to memory bandwidth, numerical methods should be developed 
so that the computational intensity can be higher as achieved by 
the present method. 
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